Friday, June 19, 2009

advance database designing

I began to discuss some general best practices that I have found to be particularly helpful. Again, none of it is specific to any one vendor's product and everything should, therefore, be applicable, regardless of which database implementation is being used. In this concluding article, I'll attempt to give an approachable introduction to the topic of database normalization and to the five Normal Forms. I'll also discuss other possible uses for a database in a project, such as a repository for configurational or logging data, for example.

Normalization

No discussion of relational database (DB) design is complete without a section on normalization. A normalized DB schema avoids certain anomalies when inserting, updating, or deleting data and, therefore, helps to keep consistent data in the database.

However, the absence of anomalies is only the tangible result of a deeper benefit of normalization -- namely the correct identification and modeling of entities. The insert, update, and delete anomalies I've just referred to are the consequences of the redundancy introduced by improper or inadequate separation between distinct entities. The normalization procedure is, therefore, not just a technical chore to be done out of principle, but it can actively help to improve the understanding of the business domain.

Regrettably, the treatment of normalization is often prohibitively formal, and it suffers from a special, rather non-intuitive terminology. This is unfortunate since the outcome of a normalization procedure often evokes the reaction that it all is nothing more than common sense. I will try to offer explanations of expressions that you are likely to encounter in the literature as they come up in the following discussion.

Overview

Normalization is a process in which an initial DB design is transformed, or decomposed, into a different, but equivalent, design. The resulting schema is equivalent to the original one in the sense that no information is lost when going from one to the other.

The normalization procedure consists of a sequence of projections -- that is, some attributes are extracted from one table to form a new one. In other words, tables are split up vertically. The decomposition is lossless, only if you can restore the original table by joining its projections.

Through such non-loss decompositions it is possible to transform an original schema into a resulting one that satisfies certain conditions, known as Normal Forms:

The First Normal Form (1NF) addresses the structure of an isolated table.
The Second (2NF), Third (3NF), and Boyce-Codd (BCNF) Normal Forms address one-to-one and one-to-many relationships.
The Fourth (4NF) and Fifth (5NF) Normal Forms deal with many-to-many relationships.
These Normal Forms form a hierarchy in such a way that a schema in a higher normal form automatically fulfills all the criteria for all of the lower Normal Forms.

The Fifth Normal Form is the ultimate normal form with respect to projections and joins -- it is guaranteed to be free of anomalies that can be eliminated by taking projections.

In the following discussion, any mention of keys refers to the conceptual keys formed from business data, not to any plainly technical surrogate keys which might have been defined.

First Normal Form

A table is said to be in First Normal Form (1NF), if all entries in it are scalar-valued. Relational database tables are 1NF by construction since vector-valued entries are forbidden. Vector-valued data (that is, entries which have more than one value in each row) are referred to as repeating groups.

The following relation violates 1NF because the SupplierID forms a repeating group (here and in the following examples and text, primary key fields are in bold):

{ PartID, Supplier1ID, Supplier2ID, Supplier3ID }

Repeating groups indicate a one-to-many relationship -- in other words, a relationship which in relational databases is treated using foreign keys. Note that the problem of repeating groups cannot be solved by adding any number of fields to a record; even if the number of elements of the vector-valued data was fixed, finite, and predetermined, searching for a value in all these parallel fields is prohibitively cumbersome.

To achieve 1NF, eliminate repeating groups by creating separate tables for each set of related data.

To demonstrate the typical anomalies that occur in tables that are only 1NF, consider the following example:

{ CustomerID, OrderID, CustomerAddress, OrderDate }

Note the following problems:

Insert: It is not possible to add a record for a customer who has never placed an order.
Update: To change the address for a customer, this change has to be repeated for all of the customer's existing orders.
Delete: Deleting the last order for a customer loses all information about the customer.
Functional dependency

The Second and Third Normal Forms address dependencies among attributes, specifically between key and non-key fields.

By definition, a key uniquely determines a record: Knowing the key determines the values of all the other attributes in the table row, so that given a key, the values of all the other attributes in the row are fixed.

This kind of relationship can be formalized as follows. Let X and Y be attributes (or sets of attributes) of a given relationship. Then Y is functionally dependent on X if, whenever two records agree on their X-values, they must also agree on their Y-values. In this case, X is called the determinant and Y is called the dependent. Since for any X there must be a single Y, this relationship represents a single-valued functional dependency. If the set of attributes in the determinant is the smallest possible (in the sense that after dropping one or more of the attributes from X, the remaining set of attributes does no longer uniquely determine Y), then the dependency is called irreducible.

Note that functional dependency is a semantic relationship: It is the business logic of the problem domain, represented by the relation, which determines whether a certain X determines Y.

Second Normal Form

A table is in Second Normal Form (2NF) if every non-key field is a fact about the entire key. In other words, a table is 2NF if it is 1NF and all non-key attributes are functionally dependent on the entire primary key (that is, the dependency is irreducible).

Clearly, 2NF is only relevant when the key is composite (that is, consisting of several fields). The following example describes a table which is not 2NF since the WarehouseAddress attribute depends only on WarehouseID but not on PartID:

{ PartID, WarehouseID, Quantity, WarehouseAddress }

To achieve 2NF, create separate tables for sets of values that apply to multiple records and relate these tables through foreign keys. The determinants of the initial table become the primary keys of the resulting tables.

Third Normal Form

A relation is in Third Normal Form (3NF) if it is 2NF and none of its attributes is a fact about another non-key field. In other words, no non-key field functionally depends on any other non-key field. (Such indirect dependencies are known as transitive dependencies.)

The following example violates 3NF since the Location is functionally dependent on the DepartmentID:

{ EmployeeID, DepartmentID, Location }

To achieve 3NF, eliminate fields that do not depend on the key from the original table and add them to the table whose primary key is their determinant.

To summarize the normalization procedure up to and including Third Normal Form:

Every field in a record must depend on The Key (1NF), the Whole Key (2NF), and Nothing But The Key (3NF).
Boyce-Codd Normal Form

Boyce-Codd Normal Form (BCNF) is an extension of 3NF in the case with two or more candidate keys which are composite and overlapping (that is, they have at least one field in common). If these conditions are not fulfilled, 3NF and BCNF are equivalent. A table is BCNF if, and only if its only determinants are candidate keys.

In the following table, both {SupplierID, PartID}, as well as {SupplierName, PartID}, are candidate keys. The table is not BCNF since it contains two determinants (SupplierID and SupplierName) which are not candidate keys. (SupplierID and SupplierName are determinants, since they determine each other.)

{ SupplierID, PartID, SupplierName, Quantity }

However, either of the following decompositions is BCNF:

{ SupplierID, SupplierName }
{ SupplierID, PartID, Quantity }

or

{ SupplierName, SupplierID }
{ SupplierName, PartID, Quantity }

To achieve BCNF, remove the determinants which are not candidate keys.

Many-to-many relationships and higher Normal Forms

Fourth and Fifth Normal Forms apply to situations involving many-to-many relationships. In relational databases, many-to-many relationships are expressed through cross-reference tables.

As an example, consider a case of class enrollment. Each student can be enrolled in one or more classes and each class can contain one or more students. Clearly, there is a many-to-many relationship between classes and students. This relationship can be represented by a Student/Class cross-reference table:

{ StudentID, ClassID }

The key for this table is the combination of StudentID and ClassID. To avoid violation of 2NF, all other information about each student and each class is stored in separate Student and Class tables, respectively.

Note that each StudentID determines not a unique ClassID, but a well-defined, finite set of values. This kind of behavior is referred to as multi-valued dependency of ClassID on StudentID.

Fourth Normal Form

A table is in Fourth Normal Form (4NF) if it is 3NF and it does not represent two or more independent many-to-many relationships.

Consider an example with two many-to-many relationships, between students and classes and between classes and teachers. Also, a many-to-many relationship between students and teachers is implied. However, the business rules do not constrain this relationship in any way -- the combination of StudentID and TeacherID does not contain any additional information beyond the information implied by the student/class and class/teacher relationships. Consequentially, the student/class and class/teacher relationships are independent of each other -- these relationships have no additional constraints. The following table is, then, in violation of 4NF:

{ StudentID, ClassID, TeacherID }

As an example of the anomalies that can occur, realize that it is not possible to add a new class taught by some teacher without adding at least one student who is enrolled in this class.

To achieve 4NF, represent each independent many-to-many relationship through its own cross-reference table.

Fifth Normal Form

A table is in Fifth Normal Form (5NF) if it is 4NF and its information content cannot be reconstructed from several tables containing fewer attributes.

Consider again the student/class/teacher example, but now assume that there is an additional relationship between students and teachers. The previous example table is now 4NF, since all the relationships it describes are interrelated. However, it is not 5NF, since it can be reconstructed from three cross-reference tables, each representing one of the three many-to-many relationships:

{ StudentID, ClassID }
{ ClassID, TeacherID }
{ TeacherID, StudentID }

To achieve 5NF, isolate interrelated many-to-many relationships, introducing the required number of new tables to represent all business domain constraints.

Normalization in context

In practice, many databases are de-normalized to greater or lesser degree. The reason most often stated has to do with performance -- a de-normalized database may require fewer joins and can, therefore, be faster for retrievals.

While this reasoning may be true, the usual caveats against premature optimization apply here as well as everywhere else. First, you should determine sufficiently that a performance problem exists and that the proposed de-normalization improves it before introducing a conceptually suboptimal design.

Furthermore, a de-normalized schema can be harder to update. The additional integrity checks that are necessary in this case may offset the performance gains for queries obtained through denormalization.

Finally, it should be noted that dealing with many-to-many relationships raises some issues that cannot be fully resolved through normalization (Chris Date's article, "Normalization is no Panacea," in Resources covers this topic).

--------------------------------------------------------------------------------
Back to top
History tables and event logging

Besides holding the data that is necessary to support the primary business purpose of the system under construction, the DB is also a possible location to record information that is useful primarily for internal technical purposes, such as adminstration and maintenance of the system itself.

History tables

In a production system, you may desire to preserve the history of changes to the data in the live database. This can be achieved through the use of history (or backup) tables, and the appropriate INSERT, DELETE, and UPDATE triggers.

Each table in the DB should have a history table, mirroring the entire history of the primary table. If entries in the primary table are to be updated, the old contents of the record are first copied to the history table before the update is made. In the same way, deleted records in the primary table are copied to the history table before being deleted from the primary one. The history tables always have the name of the corresponding primary one, but with _Hist appended.

Entries to the history table are always appended at the end. The history table, therefore, grows strictly monotonically in time. It will become necessary to periodically spool ancient records to tape for archiving. Such records may, as a result, not be immediately available for recall.

The attributes of the history table should agree exactly with the attributes of the primary table. In addition, the history table records the date and type of the change to the primary table. The type is one of the following: Create, Update, or Delete.

Changes to the structure of the primary table affect the history table. When an attribute is added to the primary table, it is added to the history table as well. When an attribute is deleted from the primary table, the corresponding attribute is not deleted from the history table. Instead, this field is left blank (NULL) in all future records. Consequentially, the history table not only grows in length over time, but also in width.

Note that the choice to use such a history mechanism affects neither the overall DB layout, nor applications that access only the primary tables. During development, you can probably dispense with recording changes in this way and leave the creation of the history tables and the necessary triggers until installation time.

Event logging for fun and profit

A database can be used as an event logger. The notion of event is broad, ranging from common debugging and system specific runtime information, to events which are specific to the business domain. Possible candidates for events to be logged to the database include:

Transactions making changes to persistent data
Transactions crossing component boundaries
Errors and exceptions
Dispatching of messages to the user
Events involving financial transactions
State changes to business entities
An EventLog table to log such information contains at least these fields to record:

Timestamp
EventType (a type code)
Details (a descriptive string)
Optionally, it may identify an owner or originator of the event. The owner concept can either identify a logged-in user or admin, but it may as well describe a part or module of the system itself. In applications dealing with financial transactions, additional (optional) fields identifying the from- and to-accounts can be useful.

System config tables

Finally, it is possible to use the database as centralized storage for configurational data. Usually this information is kept distributed in miscellaneous plain-text files, such as start-up scripts or property files. The database can provide a single, managed storage facility for such information.

Besides start-up parameters, which are usually supplied to the system at boot-time, one may also think of properties that are required at runtime, such as localized strings and messages.

Lastly, the database is a possible place to keep system documentation. This is most useful, of course, for information that is naturally in tabular form (rather than free text), such as lists of assigned port numbers or shared memory keys, for instance. But this approach is not limited to codes. A data dictionary, defining the permissible values for each field, is a necessity on any non-trivial project. This also can be made accessible to all developers and administrators by storing it in the database.

In any case, the data is stored in simple key/value pairs. Additional table attributes can contain comments or pointers (URLs) to relevant offline documentation.

The primary advantage to keeping such information in the database is that the database provides a central repository for all relevant information, as opposed to the typical approach in which data is scattered over miscellaneous files.

--------------------------------------------------------------------------------
Back to top
Summary

In this article I've covered database normalization and the five Normal Forms. In the normalization process, an original database design is transformed into an equivalent one, which avoids certain anomalies when inserting, updating, or deleting records. Proper normalization also helps to identify entities correctly. I also discussed the possible use of a database as a central repository for logging information or configurational data.

Resources

Check out Barry Wise's "Database Normalization and Design Techniques"which gives a simple and accessible introduction to the Normal Forms and the normalization process.

Read William Kent's "A Simple Guide to Five Normal Forms in Relational Database Theory" for a thorough primer on the five Normal Forms.

Look at C.J. Date's article, "Normalization is no Panacea," which discusses some problems that cannot be solved by normalization alone, since (in his own words): "It's always important to understand the limitations of any technology on which we rely heavily."

Avail yourself of this white paper on Simple Conditions for Guaranteeing Higher Normal Forms in Relational Databases.

See the results of last year's survey on Winning Database Configurations.

Basic DataBase Designing

In this series, I discuss some general best practices that I have found to be particularly helpful. Nothing in it is specific to any one vendor's product and everything should, therefore, be applicable, regardless of which database implementation is being used. In this frist part of the article, I want to talk about both simple and complex datatypes, and about primary and foreign keys, which are the plumbing that holds the entire database together.

Primary keys and related matters

A relational database (DB) stores two kinds of information -- data and plumbing. Data comprises the customer names, inventory numbers, item descriptions, and so on, that the application uses. Plumbing refers to the primary and foreign keys that the DB needs to find database records and relate them to one another.

Basic plumbing

For the purpose of data modeling, the plumbing should be largely transparent. In fact, purist DB lore makes no distinction between data and plumbing. However, you will see that it is more efficient for administration and maintenance, as well as in terms of runtime performance, to have some additional fields to serve as DB keys.

Every table must have a primary key: an attribute or combination of attributes that are guaranteed to be unique and not-null. It is generally helpful to introduce a surrogate key -- a table attribute which has no business meaning, but simply serves as unique identifier for each record in the table. This is the plumbing that I have been referring to.

The requirements for a primary key are very strict. It must:

Exist
Be unique
Not change over time
Surrogate keys help to mitigate the fact that real business data never reliably fulfills these requirements. Not every person has a Social Security Number (think of those outside the U.S.), people change their names, and other important information.

Business data might also simply be bad -- glitches in the Social Security Administration's system may lead to different persons having the same Social Security Number. A surrogate key helps to isolate the system from such problems.

The second reason that surrogate keys are favorable has to do with efficiency and ease of maintenance, since you can choose the most efficient datatype for the surrogate key. Furthermore, the surrogate key typically is a single field (not a compound key), which simplifies the schema (particularly when the key is used in other tables as a foreign key).

Every table should have a dedicated column to serve as this table's primary key. This column may be called id or pk (or possibly _id or _pk). Most databases are tuned for queries on integers, so it makes sense to use this datatype as primary key. Many databases, including Postgres and Oracle, also provide a special serial or sequence integer type, which generates a sequence of unique integers. Declaring a column to be of this type guarantees that a unique key is generated for each inserted row.

Foreign keys are table attributes, the values of which are the primary keys of another table. It often makes sense to label foreign key columns explicitly, for instance, by adopting a naming convention such as _fk. A referential integrity constraint (references) should be declared as part of the CREATE statement when creating the table.

It bears repeating that the surrogate keys discussed earlier are part of the plumbing only -- their existence does not obviate the modeling requirement to be able to form a primary key from the business data alone. Such a business data candidate key is a subset of all attributes, the values of which are never null, and each combination of values is unique. As a check on correct data modeling, such a candidate key must exist and should be documented for every table.

Strictly speaking, you may not always find a candidate key among the business data. Imagine a table recording the first and last name for each user, but having no further attributes. Now assume that there are two different persons, both of whom have the first name "Joe" and last name "Blow." In such a case, there exists no combination of table attributes that can form a suitable candidate key.

The underlying problem here is whether you are talking about the uniqueness of datasets or about the uniqueness of the underlying entities -- users, in this example. It is generally more intuitive, in particular to developers used to object-oriented analysis, to model the uniqueness of the underlying entities. Surrogate keys as discussed earlier can help to achieve this.

Alternate keys and visible identifiers

As part of the plumbing, the surrogate key has no need to ever be visible outside the DB. In particular, it should never be revealed to the user. This allows the DB administrator to change the representation of the keys at will if necessary. If a business need arises for providing the user with a unique identifier to a particular dataset, this identifier should be considered real business data and kept separate from the plumbing. For instance, an additional column called VisibleAccountNumber or the like can be introduced. Of course, this attribute should be non-null and unique so that it forms an alternative candidate key (an alternate key). Having a separate column for visible identifiers also makes it possible to generate and format the values for this attribute in a user-friendly way so that it is easy to read over the phone to a customer support person, for instance.

A borderline case is when the identifier is not directly visible, but may still be accessible to the user. Examples include hidden fields in Web pages in which an identifier is shuttled to the client to be used as a parameter in the following request. Although there is no need for the user to handle the identifier, a malicious user may read and attempt to spoof it. Using the numerical values of a primary key directly, in principle, allows any attacker to walk the entire table!

Defences against this problem include either encrypting and decrypting the value of the primary key, or protecting the key by appending a Message Authentication Code (MAC). An alternative is to use a hard-to-spoof visible identifier attribute for the table, such as the hash of the record's primary key or creation timestamp. (Of course, the uniqueness of this attribute must be assured.)

Whether the key is visible to the application (as opposed to the end user) depends on the specifics of the project. Using a numeric type directly carries the key's database representation straight into the application code and should be avoided to prevent coupling. In small-scale developments, a String representation of the key's value may be acceptable (all datatypes that can be stored in a DB must be able to be serialized).

But a better solution is a simple wrapper object that adds very little complexity, but provides strong decoupling of the database keys' representation from their interfaces. A danger exists in making the wrapper object too smart. The intention with surrogate keys is to make them simple and efficient for the database to handle. Settings from a database value and possibly from a String, comparing with another key object, and possibly serializing are all the methods that are required. Smarts, such as the ability to verify the contents based on a checksum calculation, suggest that this object probably belongs to the business data domain (like the visible record identifiers, introduced earlier).

The problem of the Universally Unique Identifier

A final consideration concerns the possible need for a Universally Unique Identifier (UUID). The short answer is that relational databases do not require UUIDs at all. In fact, the entire UUID concept is somewhat unrelated to relational database management. Relational database keys -- the plumbing -- need only be unique per table, which can be achieved by using an auto-incrementing datatype such as the serial type mentioned earlier.

UUIDs can have some technical difficulties. To ensure uniqueness, all UUIDs must be generated by a centralized service -- which leads to scalability problems and can become a single point of failure. (The scalability issue can be mitigated by a stratified approach in which a central master is used to give out seeds to several slaves, which in turn generate the final identifiers in batches, and so on.) To represent a UUID in a database, use either a string attribute or a compound key comprising several integer columns. Both approaches are significantly slower than operations based on keys made up of long integers. Compound keys also increase the complexity of the DB schema when used as foreign keys.

In the end, whether or not a record in a database needs to have a truly globally unique ID is dictated by the business rules, not the database architecture. Some records may already contain some form of UUID (merchandise items, for instance, typically possess a Universal Product Code as barcode). Some other records, possibly corresponding to principal business entities, may otherwise already contain a unique identifier as part of their business data (such as the combination of timestamp and account name for a ledger entry). If this is not the case, a UUID can be generated and stored alongside the business data for those records that require it. In any case, UUIDs should be considered part of the business data -- not of the plumbing.

Even if (and this is a big if) the object-relational mapping approach chosen requires every business object to have a persistent, unique ID, there is no need to base the internal workings of the underlying relational database engine on this fact.

In summary, I argue to keep business data apart from the database's internal plumbing. Building a relational database around UUIDs breaks this principle by using attributes, which, in the end, are really part of the business data, as internal infrastructure. (For a totally different point of view on this issue and a careful discussion of the problems involved in generating UUIDs in a scalable fashion, see Scott Ambler's paper, "Mapping objects to relational databases," in Resources.)

--------------------------------------------------------------------------------
Back to top
Datatypes

The SQL standard defines a number of standard datatypes and most DB vendors support some additional ones which are specific to their own product. In the absence of truly compelling reasons to the contrary, avoid such extensions for the sake of portability.

Strings and numbers

In general, numerical types pose few problems -- just select one that is large enough to support the necessary range of values.

The attempt to find the optimal width of a string column is usually not worth the effort. You can avoid a lot of confusion later on by making all text messages of type varchar(n) and limiting yourself to a few standard string lengths and introducing aliases for them, such as: 32 bytes ("Label"), 256 bytes ("Note"), and 4k ("Text").

Even if other business requirements restrict the maximum length of certain fields to specific values, the DB schema is arguably not the best place to enforce these rules. By the time the data reaches the DB, it is too late to do anything about it (except reject it). Individual restrictions, stemming from business rules and requirements, should be enforced by the business logic layer, which handles user interaction and input validation. On the other hand, maintenance of the DB schema is considerably simplified if it is restricted to a handful of different string attributes.

Limit the use of fixed-width strings to codes of all sorts (as opposed to variable-length strings for real text). Keep in mind however, that many seemingly fixed-length codes do actually become wider over time. The prudent DB designer tries to avoid anything similar to the Y2K problem for new development work.

Time and money

A type to record timestamps (date/time combinations) is always necessary and is, fortunately, covered by the SQL standard. No fully satisfying way exists to record a monetary value, however.

Saving monetary values and treating them in program code as a floating-point values always leads to round-off errors. Recording the value as an exact integer of the smallest currency subdivision (such as "cent" for US dollars, as well as for Euros and other appropriate currencies) may not be sufficient either. Many values carry more digits behind the decimal point than the two for which actual coins exist (just visit your local gas station). A choice of decimal with 5 to 9 digits should work, though.

It goes without saying that no monetary value should ever be recorded without also recording the currency -- even if you think your application will never handle anything but US dollars. Consider setting up a currency table and relating it to the monetary values using foreign keys rather than embedding currency information directly. This helps with internationalization (different currency names and symbols), as well as with formatting issues.

Booleans and the System Types Table

The use of the type bool anywhere in the design is a hint to rethink this particular module again. Very few attributes are truly limited to only two values -- even a gender column has a malicious tendency to morph towards (at least) three states -- male, female, and unknown. Allowing nulls only masks the real problem. A more flexible approach to type codes is needed.

In many places in the database, attributes determine a record's type in some way. The GenderType mentioned above is such an instance. Other examples may include ItemType (such as Merchandise, Freight, Packaging, Insurance), PaymentType (Cash, Check, MoneyOrder, CreditCard, Barter), and things like StoreType, MembershipType, DeviceType, and many more. This also includes instances in which you want to store the type of an object in some applicable object model.

For each type, you need some form of documentation telling you not only the name of the type, but also the characteristics associated with it; for instance, you may want to know what permissions each UserType entails. What better place to keep this information than in the database itself?

Any record that has some form of type information associated with it should contain a type code column, which itself is a foreign key referencing a type code table. The foreign key constraint ensures that no records with invalid types exist. The type code table might have the following attributes:

typeCode_pk
label (unique mnemonic, such as varchar(32))
description (varchar(256) should be sufficient)
uri (pointing to additional resources, where necessary)
codeGroup_fk
Additional attributes are, of course, conceivable -- such as a three-letter code or a visible numeric code.

The codeGroup_fk attribute serves to organize related type codes. For instance, all subscriber types might form a group. The codeGroup_fk attribute is a foreign key into a separate code group table. However, realizing that a code group is nothing but a type code itself, the relationship can be made recursive so that codeGroup_fk references typeCode_pk. Not only does this make the additional type code table unnecessary, but it also makes it possible to order groups in an arbitrarily deep hierarchical structure. It is best to keep the type of code system relatively simple and straightforward.

Complex datatypes

Finally, there are some common but complex datatypes -- such as phone numbers, postal addresses, contact information, and credit cards -- that occur in almost every database schema. Typically, such records need to be accessed from many tables in the database. In a typical eCommerce system, for instance, it might be necessary to store contact information for users, suppliers, warehouses, and admins.

Rather than including those attributes in the respective user, supplier, or other record. (and thereby repeating those columns throughout the database), it makes sense to set up a single table for the contact information that is referenced through foreign keys by all other tables. This has two immediate benefits:

It is easier to later change the cardinality of the relationships.
It localizes any future changes to the complex datatype.
Anticipating the attributes that will likely be required for each of those complex types is something of an art. My suggestion is to try to strive for completeness from the outset rather than being forced to change the schema each time an additional field becomes necessary.

A sampling of possible attributes for postal addresses includes:

Department
Company
Mail Stop
Address Line 1
Address Line 2
Address Line 3
City
State
Postal Code ("Zip")
Country
Full contact information might include the following attributes:

Title
First Name
Middle Name (or Initial)
Last Name
Suffix (such as jr. or sr.)
Home address
Work address
Home Phone
Work Phone
Cell Phone
Fax
Pager
Email
Finally, phone numbers should never be considered flat numbers. In fact, they break down into the following fields:

CountryCode
AreaCode (Number Plan Area)
ExchangeCode (Prefix)
LineNumber (Suffix)
Extension
In a phone number such as 987-1234, the prefix is the 987 and the suffix is the 1234. The extension is the only part of the phone number that is optional. It is probably sufficient to use char(4) for all columns, but one might consider char(6) to be on the safe side. Note that area codes in the US are limited to three digits, but this is not true for other countries.

Sensitive data should be kept in encrypted form. Even if the database system itself is compromised, the data is still protected from misuse. The most famous example of this kind of data management is the Unix password system which only stores hashes of the users' passwords rather than the passwords themselves. Some data, such as credit card numbers, needs to be encrypted in a recoverable fashion; however, a one-way encryption (as for the Unix password file) will not do. This leads to the problem of encryption key management -- clearly, it should not be stored in the DB, along with the secrets, but supplied at boot-time, for instance.

--------------------------------------------------------------------------------
Back to top
Summary

In this article, I've discussed some general best practices when designing relational databases, including:

The benefits of maintaining additional table attributes without any business meaning to serve as surrogate keys.
The recommendation not to base the internal workings of the database on Universally Unique Identifiers.
The use of a centralized type codes facility to express attributes with finite, predetermined ranges of values.
Some considerations in designing complex datatypes to be used throughout the database schema.
In the next half of this article, I'll cover database normalization and some additional uses for a database within a project, like the use of history tables and event logs.

Resources

Learn about the importance of the surrogate primary key when designing stable, flexible, and well-performing enterprise databases in Mike Lonigro's article, "The Case for the Surrogate Key."

Read the book, Translucent Databases , as Peter Wayner offers a better, deeper protection paradigm for databases, one that works by scrambling the data with encryption algorithms, using the minimal amount of encryption that ensures the database is still highly functional.

Search developerWorks' vast library for more than 1,000 articles on database design.

Discover how DB2 Relational Connect helps with information integration by allowing several databases to be viewed and manipulated as one.

Tuesday, June 16, 2009

Search engine optimization

I. Introduction – What Is SEO

Search Engine Optimization (SEO) is often considered the more technical part of Web marketing. This is true because SEO does help in the promotion of sites and at the same time it requires some technical knowledge – at least familiarity with basic HTML. SEO is sometimes also called SEO copyrighting because most of the techniques that are used to promote sites in search engines deal with text. Generally, SEO can be defined as the activity of optimizing Web pages or whole sites in order to make them more search engine-friendly, thus getting higher positions in search results.

One of the basic truths in SEO is that even if you do all the things that are necessary to do, this does not automatically guarantee you top ratings but if you neglect basic rules, this certainly will not go unnoticed. Also, if you set realistic goals – i.e to get into the top 30 results in Google for a particular keyword, rather than be the number one for 10 keywords in 5 search engines, you will feel happier and more satisfied with your results.

Although SEO helps to increase the traffic to one's site, SEO is not advertising. Of course, you can be included in paid search results for given keywords but basically the idea behind the SEO techniques is to get top placement because your site is relevant to a particular search term, not because you pay.

SEO can be a 30-minute job or a permanent activity. Sometimes it is enough to do some generic SEO in order to get high in search engines – for instance, if you are a leader for rare keywords, then you do not have a lot to do in order to get decent placement. But in most cases, if you really want to be at the top, you need to pay special attention to SEO and devote significant amounts of time and effort to it. Even if you plan to do some basic SEO, it is essential that you understand how search engines work and which items are most important in SEO.

How Search Engines Work

The first basic truth you need to learn about SEO is that search engines are not humans. While this might be obvious for everybody, the differences between how humans and search engines view web pages aren't. Unlike humans, search engines are text-driven. Although technology advances rapidly, search engines are far from intelligent creatures that can feel the beauty of a cool design or enjoy the sounds and movement in movies. Instead, search engines crawl the Web, looking at particular site items (mainly text) to get an idea what a site is about. This brief explanation is not the most precise because as we will see next, search engines perform several activities in order to deliver search results – crawling, indexing, processing, calculating relevancy, and retrieving.

First, search engines crawl the Web to see what is there. This task is performed by e piece of software, called a crawler or a spider (or Googlebot, as is the case with Google). Spiders follow links from one page to another and index everything they find on their way. Having in mind the number of pages on the Web (over 20 billion), it is impossible for a spider to visit a site daily just to see if a new page has appeared or if an existing page has been modified. Sometimes crawlers will not visit your site for a month or two, so during this time your SEO efforts will not be rewarded. But there is nothing you can do about it, so just keep quiet.

What you can do is to check what a crawler sees from your site. As already mentioned, crawlers are not humans and they do not see images, Flash movies, JavaScript, frames, password-protected pages and directories, so if you have tons of these on your site, you'd better run the Spider Simulator below to see if these goodies are viewable by the spider. If they are not viewable, they will not be spidered, not indexed, not processed, etc. - in a word they will be non-existent for search engines.

After a page is crawled, the next step is to index its content. The indexed page is stored in a giant database, from where it can later be retrieved. Essentially, the process of indexing is identifying the words and expressions that best describe the page and assigning the page to particular keywords. For a human it will not be possible to process such amounts of information but generally search engines deal just fine with this task. Sometimes they might not get the meaning of a page right but if you help them by optimizing it, it will be easier for them to classify your pages correctly and for you – to get higher rankings.

When a search request comes, the search engine processes it – i.e. it compares the search string in the search request with the indexed pages in the database. Since it is likely that more than one pages (practically it is millions of pages) contains the search string, the search engine starts calculating the relevancy of each of the pages in its index to the search string.

There are various algorithms to calculate relevancy. Each of these algorithms has different relative weights for common factors like keyword density, links, or metatags. That is why different search engines give different search results pages for the same search string. What is more, it is a known fact that all major search engines, like Yahoo!, Google, MSN, etc. periodically change their algorithms and if you want to keep at the top, you also need to adapt your pages to the latest changes. This is one reason (the other is your competitors) to devote permanent efforts to SEO, if you'd like to be at the top.

The last step in search engines' activity is retrieving the results. Basically, it is nothing more than simply displaying them in the browser – i.e. the endless pages of search results that are sorted from the most relevant to the least relevant sites.

Differences Between the Major Search Engines

Although the basic principle of operation of all search engines is the same, the minor differences between them lead to major changes in results relevancy. For different search engines different factors are important. There were times, when SEO experts joked that the algorithms of Yahoo! are intentionally made just the opposite of those of Google. While this might have a grain of truth, it is a matter a fact that the major search engines like different stuff and if you plan to conquer more than one of them, you need to optimize carefully.

There are many examples of the differences between search engines. For instance, for Yahoo! and MSN, on-page keyword factors are of primary importance, while for Google links are very, very important. Also, for Google sites are like wine – the older, the better, while Yahoo! generally has no expressed preference towards sites and domains with tradition (i.e. older ones). Thus you might need more time till your site gets mature to be admitted to the top in Google, than in Yahoo!.

Keywords – the Most Important Item in SEO

Keywords are the most important SEO item for every search engine – actually they are what search strings are matched against. So you see that it is very important that you optimize your site for the right keywords. This seems easy at first but when you get into more detail, it might be a bit confusing to correctly determine the keywords. But with a little research and thinking the problem of selecting the right keywords to optimize for can be solved.

1. Choosing the Right Keywords to Optimize For

It seems that the time when you could easily top the results for a one-word search string is centuries ago. Now, when the Web is so densely populated with sites, it is next to impossible to achieve constant top ratings for a one-word search string. Achieving constant top ratings for two-word or three-word search strings is a more realistic goal. If you examine closely the dynamics of search results for popular one-word keywords, you might notice that it is so easy one week to be in the first ten results and the next one– to have fallen out of the first 30 results because the competition for popular one-word keywords is so fierce and other sites have replaced you.

Of course, you can include one-word strings in your keywords list but if they are not backed up by more expressions, do not dream of high ratings. For instance, if you have a site about dogs, “dog” is a mandatory keyword but if you do not optimize for more words, like “dog owners”, “dog breeds”, “dog food”, or even “canine”, success is unlikely, especially for such a popular keyword. The examples given here are by no means the ultimate truth about how to optimize a dog site but they are good enough to show that you need to think broad when choosing the keywords.

Generally, when you start optimization, the first thing you need to consider is the keywords that describe the content of your site best and that are most likely to be used by users to find you. Ideally, you know your users well and can guess correctly what search strings they are likely to use to search for you. One issue to consider is synonyms. Very often users will use a different word for the same thing. For instance, in the example with the dog site, “canine” is a synonym and it is for sure that there will be users who will use it, so it does not hurt to include it now and then on your pages. But do not rush to optimize for every synonym you can think of – search engines themselves have algorithms that include synonyms in the keyword match, especially for languages like English.

Instead, think of more keywords that are likely to be used to describe your site. Thinking thematically is especially good because search engines tend to rate a page higher if it belongs to a site the theme of which fits into the keyword string. In this aspect it is important that your site is concentrated around a particular theme – i.e. dogs. It might be difficult to think of all the relevant keywords on your own but that is why tools are for. For instance, the Website Keyword Suggestions Tool below can help you to see how search engines determine the theme of your web site and what keywords fit into this theme. You can also try Google's Keyword Tool to get more suggestions about which keywords are hot and which are not.

When choosing the keywords to optimize for, you need to consider not only their relevancy to your site and the expected monthly number of searches for these particular keywords. Very often narrow searches are more valuable because the users that come to your site are those that are really interested in your product. If we go on with the dog example, you might discover that the “adopt a dog” keyphrase brings you more visitors because you have a special section on your site where you give advice on what to look for when adopting a dog. This page is not of interest of current dog owners but to potential dog owners only, who might be not so many in number but are your target audience and the overall effect of attracting this niche can be better than attracting everybody who is interested in dogs in general. So, when you look at the numbers of search hits per month, consider the unique hits that fit into the theme of your site.

2. Keyword Density

After you have chosen the keywords that describe your site and are supposedly of interest to your users, the next step is to make your site keyword-rich and to have good keyword density for your target keywords. Keyword density is a common measure of how relevant a page is. Generally, the idea is that the higher the keyword density, the more relevant to the search string a page is. The recommended density is 3-7% for the major 2 or 3 keywords and 1-2% for minor keywords. Try the Keyword Density Checker below to determine the keyword density of your website.

Although there are no strict rules, try optimizing for a reasonable number of keywords – 5 or 10 is OK. If you attempt to optimize for a list of 300, you will soon see that it is just not possible to have a good keyword density for more than a few keywords, without making the text sound artificial and stuffed with keywords. And what is worse, there are severe penalties (including ban from the search engine) for keyword stuffing because this is considered an unethical practice that tries to manipulate search results.

3. Keywords in Special Places

Keywords are very important not only as quantity but as quality as well – i.e. if you have more keywords in the page title, the headings, the first paragraphs – this counts more that if you have many keywords at the bottom of the page. The reason is that the URL (and especially the domain name), file names and directory names, the page title, the headings for the separate sections are more important than ordinary text on the page and therefore, all equal, if you have the same keyword density as your competitors but you have keywords in the URL, this will boost your ranking incredibly, especially with Yahoo!.

a. Keywords in URLs and File Names

The domain name and the whole URL of a site tell a lot about it. The presumption is that if your site is about dogs, you will have “dog”, “dogs”, or “puppy” as part of your domain name. For instance, if your site is mainly about adopting dogs, it is much better to name your dog site “dog-adopt.net” than “animal-care.org”, for example, because in the first case you have two major keywords in the URL, while in the second one you have no more than one potential minor keyword.

When hunting for keyword rich domain names, don't get greedy. While from a SEO point of view it is better to have 5 keywords in the URL, just imagine how long and difficult to memorize the URL will be. So you need to strike a balance between the keywords in the URL and site usability, which says that more than 3 words in the URL is a way too much.
Probably you will not be able to come on your own with tons of good suggestions. Additionally, even if you manage to think of a couple of good domain names, they might be already taken. In such cases tools like the Tool below can come very handy.

File names and directory names are also important. Often search engines will give preference to pages that have a keyword in the file name. For instance http://mydomain.com/dog-adopt.html is not as good as http://dog-adopt.net/dog-adopt.html but is certainly better than http://mydomain.com/animal-care.html. The advantage of keywords in file names over keywords in URLs is that they are easier to change, if you decide to move to another niche, for example.

b. Keywords in Page Titles

The page title is another special place because the contents of the

tag usually gets displayed in most search engines, (including Google). While it is not mandatory per the HTML specification to write something in the

tag (i.e. you can leave it empty and the title bar of the browser will read “Untitled Document” or similar), for SEO purposes you may not want to leave the

tag empty; instead, you'd better write the the page title in it.

Unlike URLs, with page titles you can get wordy. If we go on with the dog example, the

tag of the home page for the http://dog-adopt.net can include something like this:

Adopt a Dog – Save a Life and Bring Joy to Your Home

Everything You Need to Know About Adopting a Dog

or even longer.

c. Keywords in Headings

Normally headings separate paragraphs into related subtopics and from a literary point of view, it may be pointless to have a heading after every other paragraph but from SEO point of view it is extremely good to have as many headings on a page as possible, especially if they have the keywords in them.

There are no technical length limits for the contents of the

tags but common sense says that too long headings are bad for page readability. So, like with URLs, you need to be wise with the length of headings. Another issue you need to consider is how the heading will be displayed. If it is Heading 1 , generally this means larger font size and in this case it is recommendable to have less than 7-8 words in the heading, otherwise it might spread on 2 or 3 lines, which is not good and if you can avoid it – do it.

III. Links – Another Important SEO Item

1. Why Links Are Important

Probably the word that associates best with Web is “links”. That is what hypertext is all about – you link to pages you like and get linked by pages that like your site. Actually, the Web is woven out of interconnected pages and spiders follow the links, when indexing the Web. If not many sites link to you, then it might take ages for search engines to find your site and even if they find you, it is unlikely that you will have high rankings because the quality and quantity of links is part of the algorithms of search engines for calculating relevancy.

2. Inbound and Outbound Links

Put in layman's terms, there are two types of links that are important for SEO – inbound and outbound links. Outbound links are links that start from your site and lead to another one, while inbound links, or backlinks, come from an external site to yours, e.g. if a.com links to mydomain.com, the link from a.com is an inbound link for mydomain.com.

Backlinks are very important because they are supposed to be a measure of the popularity of your site among the Web audience. It is necessary to say that not all backlinks are equal. There are good and bad backlinks. Good backlinks are from reputable places - preferably from sites with a similar theme. These links do boost search engine ranking. Bad backlinks come from suspicious places – like link farms – and are something to be avoided. Well, if you are backlinked without your knowledge and consent, maybe you should drop the Webmaster a line, asking him or her to remove the backlink.

If you are not heavily backlinked, don't worry - buying links is an established practice and if you are serious about getting to the top, you may need to consider it. But before doing this, you should consider some free alternatives. For instance, some of the good places where you can get quality backlinks are Web directories like http://dmoz.org or http://dir.yahoo.com.

First, look for suitable sites to backlink to you using the Backlinks Builder below. After you identify potential backlinks, it's time to contact the Web master of the site and to start negotiating terms. Sometimes you can agree to a barter deal – i.e. a link exchange – they will put on their site N links to your site and you will put on your site N links to their site - but have in mind that this is a bad, risky deal and you should always try to avoid it.

Internal links (i.e. links from one page to another page on the same site) are also important but not as much as backlinks. In this connection it is necessary to say, that using images for links might be prettier but it is a SEO killer. Instead of having buttons for links, use simple text links. Since search engines spider the text on a page, they can't see all the designer miracles, like gradient buttons or flash animations, so when possible, either avoid using them, or provide a meaningful textual description in the tag, as described next.

3. Anchor text

Anchor text is the most important item in a backlink. While it does matter where a link comes from (i.e. a reputable place or a link farm), what matters more is the actual text the link starts from. Put simply, anchor text is the word(s) that you click on to open the hyperlink – e.g. if we have the best search engine, than “the best search engine” is the anchor text for the hyperlink to google.com. You see that you might have a backlink from a valuable site but if the anchor text is something like “an example of a complete failure”, you will hardly be happy with it.

When you check your backlinks, always check what their anchor text is and if there is a keyword in it. It is a great SEO boost to have a lot of backlinks from quality sites and the anchor text to include our keywords. Check the anchor text of inbound backlinks is with the Backlink Anchor Text Analyzer tool below. Besides the anchor text itself, the text around it is also important.

4. Link Practices That Are To Be Avoided

Similar to keyword stuffing, purchasing links in bulk is a practice to be avoided. It gets suspicious if you bartered 1000 links with another site in a day or two. What is more, search engines keep track of link farms (sites that sell links in bulk) and since bought links are a way to manipulate search results, this practice gets punished by search engines. So avoid dealing with link farms because it can cause more harm than do good. Also, outbound links from your site to known Web spammers or “bad guys” are also to be avoided.

As mentioned, link exchange is not a clean deal. Even if it boosts your ranking, it can have many other negative aspects in the long run. First, you do not know if the other party will keep their promise – i.e. they might remove some of the links to you. Second, they might change the context the link appears into. Third, it is really suspicious if you seem to be “married” to another site and 50% or more of your inbound and outbound links are from/to this direction.

When links are concerned, one aspect to have in mind is the ratio between inbound and outbound links. Generally speaking, if your outbound links are ten times your inbound links, this is bad but it also varies on a case by case basis. If you have a site that links to news sources or has RSS feeds, then having many outbound links is the inevitable price of fresh content.

IV. Metatags

A couple of years ago tags were the primary tool for search engine optimization and there was a direct correlation between what you wrote there and your position in search results. However, algorithms got better and today the importance of metadata is decreasing day by day, especially with Google. But still some search engines show metadata (under the clickable link in search results), so users can read what you have written and if they think it is relevant, they might go to your site. Also, some of the specialized search engines still use the metatags when ranking your site.

The meta Description tag is are one more way for you to write a description of your site, thus pointing search engines to what themes and topics your Web site is relevant to. It does not hurt to include at least a brief description, so don't skip it. For instance, for the dog adoption site, the meta Description tag could be something like this:

A potential use of the meta Keywords tags is to include a list of keywords that you think are relevant to your pages. The major search engines will not take this into account but still it is a chance for you to emphasize your target keywords. You may consider including alternative spellings (or even common misspellings of your keywords) in the meta Keywords tag. For instance, if I were to write the meta keywords tag for the dog adoption site, I would do it like that: It is a small boost to search engine top ranking but why miss the chance?

The meta Robots tag deserves more attention. In this tag you specify the pages that you do NOT want crawled and indexed. It happens that on your site you have contents that you need to keep there but you don't want it indexed. Listing this pages in the meta Robots tag is one way to exclude them (the other way is by using a robots.txt file and generally this is the better way to do it) from being indexed.

V. Content Is King

If you are new to SEO, it might be a surprise for you that text is one of the driving forces to higher rankings. But it is a fact. Search engines (and your readers) love fresh content and providing them with regularly updated, relevant content is a recipe for success. Generally, when a site is frequently updated, this increases the probability that the spider will revisit the site sooner. You can't take for sure that if you update your site daily, the spider will visit it even once a week but if you do not update your contents regularly, this will certainly drop you to from the top of search results.

For company sites that are not focused on writing but on manufacturing constantly adding text can be a problem because generally company sites are not reading rooms or online magazines that update their content daily, weekly or monthly but even for company sites there are reasonable solutions. No matter what your business is, one is for sure – it is always relevant to include a news section on your site – it can be company news or RSS feeds but this will keep the ball rolling.

1. Topical Themes or How to Frequently Add Content to Your Site

If you are doing the SEO for an online magazine, you can consider yourself lucky – fresh content is coming all the time and you just need to occasionally arrange a heading or two or a couple of paragraphs to make the site SEO-friendly. But even if you are doing a SEO for an ordinary company site, it is not all that bad - there are ways to constantly get fresh content that fits into the topic of the site.

One of the intricacies of optimizing a company site is that it has to be serious. Also, if your content smells like advertising and has no practical value for your visitors, this content is not that valuable. For instance, if you are a trade company, you can have promotional texts about your products. But have in mind that these texts must be informational, not just sales hype. And if you have a lot of products to sell, or frequently get new products, or make periodical promotions of particular products and product groups – you can post all this to your site and you will have fresh, topical content.

Also, depending on what your business is about, you can include different kinds of self-updating information like lists of hot new products, featured products, discounted items, even online calculators or order trackers. Unlike promotional pages, this might neither bring you many new visitors, nor improve your ratings but is more than nothing.

One more potential traffic trigger for company sites are news sections. Here you can include news about past and coming events, post reports about various activities, announce new undertakings, etc. Some companies even go further – their CEO keeps a blog, where he or she writes in a more informal style about what is going in the company, in the industry as a whole, or in the world in general. These blogs do attract readers, especially if the information is true, rather than the official story.

An alternative way to get fresh free content are RSS feeds. RSS feeds are gaining more and more popularity and with a little bit of searching, you can get free syndicated content for almost any topic you can think of.

2. Bold and Italic Text

When you have lots of text, the next question is how to make the important items stand out from the crowd – for both humans and search engines. While search engines (and their spiders – the programs that crawl the Web and index pages) cannot read text the way humans do, they do have ways of getting the meaning of a piece of text. Headings are one possibility, bold and italic are another way to emphasize a word or a couple of words that are important. Search engines read the and text and get the idea that what is in bold and/or italic is more important than the rest of the text. But do not use bold and italic too much – this will spoil the effect, rather than make the whole page a search engine favorite.

3. Duplicate Content

When you get new content, there is one important issue – is this content original? Because if it is not, i.e. it is stolen from another site, this will get you into trouble. But even if it is not illegal, i.e. you obtained it for free from an article feed, have in mind that you might not be only one on the Web, who has this particular stuff. If you have the rights to do it, you can change the text a little, so it is not an exact copy of another page and cannot be labeled “duplicate content” by search engines. If you don't manage to escape the duplicate content filter that search engines have imposed recently in their attempts to filter stolen, scrapped, or simply copied contents, your pages could be removed from search results!

Duplicate content became an issue when tricky webmasters started making multiple copies of the same page (under a different name) in order to fool search engines that they have more content than they actually do. As a result of this malpractice, search engines responded with a duplicate content filter that removes suspicious pages. Unfortunately, this filter sometimes removes quite legitimate pages, like product descriptions given from a manufacturer to all its resellers, which must be kept exactly the same.

You see, duplicate content can be a serious problem. But it is not an obstacle that cannot be overcome. First, you need to periodically check the Web for pages that are similar to yours. You can use http://copyscape.com. If you identify pages that are similar to yours (and it is not you who have illegitimately copied them), you could notify the webmaster of the respective site(s) to remove them. Also, you could change a little the text on your site, hoping that this way you will avoid the duplicate content penalty. Even with product descriptions, you can add commentary or opinion on the same page and this could be a way out.

Try the Similar Page Checker to check the similarity between two URLs.

VI. Visual Extras and SEO

As already mentioned, search engines have no means to index directly extras like images, sounds, flash movies, javascript. Instead, they rely on your to provide meaningful textual description and based on it they can index these files. In a sense, the situation is similar to that with text 10 or so years ago – you provide a description in the metatag and search engines uses this description to index and process your page. If technology advances further, one day it might be possible for search engines to index images, movies, etc. but for the time being this is just a dream.

1. Images

Images are an essential part of any Web page and from a designer point of view they are not an extra but a most mandatory item for every site. However, here designers and search engines are on two poles because for search engines every piece of information that is buried in an image is lost. When working with designers, sometimes it takes a while to explain to them that having textual links (with proper anchor text) instead of shining images is not a whim and that clear text navigation is really mandatory. Yes, it can be hard to find the right balance between artistic performance and SEO-friendliness but since even the finest site is lost in cyberspace if it cannot be found by search engines, a compromise to its visual appearance cannot be avoided.

With all that said, the idea is not to skip images at all. Sure, nowadays this is impossible because the result would be a most ugly site. Rather the idea is that images should be used for illustration and decoration, not for navigation or even worse – for displaying text (in a fancy font, for example). And the most important – in the attribute of the tag, always provide a meaningful textual description of the image. The HTML specification does not require this but search engines do. Also, it does not hurt to give meaningful names to the image files themselves rather than name them image1.jpg, image2.jpg, imageN.jpg. For instance, in the next example the image file has an informative name and the alt provides enough additional information: . Well, don't go to extremes like writing 20-word tags for 1 pixel images because this also looks suspicious and starts to smell like keyword-stuffing.

2. Animation and Movies

The situation with animation and movies is similar to that with images – they are valuable from a designer's point of view but are not loved by search engines. For instance, it is still pretty common to have an impressive Flash introduction on the home page. You just cannot imagine what a disadvantage with search engines this is – it is a number one rankings killer! And it gets even worse, if you use Flash to tell a story that can be written in plain text, hence crawled and indexed by search engines. One workaround is to provide search engines with a HTML version of the Flash movie but in this case make sure that you have excluded the original Flash movie from indexing (this is done in the robots.txt file but the explanation of this file is not a beginners topic and that is why it is excluded from this tutorial), otherwise you can be penalized for duplicate content.

There are rumors that Google is building a new search technology that will allow to search inside animation and movies and that the .swf format will contain new metadata that can be used by search engines, but until then, you'd better either refrain from using (too much) Flash, or at least provide a textual description of the movie (you can use an tag to describe the movie).

3. Frames

It is a good news that frames are slowly but surely disappearing from the Web. 5 or 10 years ago they were an absolute hit with designers but never with search engines. Search engines have difficulties indexing framed pages because the URL of the page is the same, no matter which of the separate frames is open. For search engines this was a shock because actually there were 3 or 4 pages and only one URL, while for search engines 1 URL is 1 page. Of course, search engines can follow the links to the pages in the frameset and index them but this is a hurdle for them.

If you still insist on using frames, make sure that you provide a meaningful description of the site in the

tag. The following example is not for beginners but even if you do not understand everything in it, just remember that the tag is the place to provide an alternative version (or at least a short description) of your site for search engines and users whose browsers do not support frames. If you decide to use the tag, maybe you'd like to read more about it before you start using it.

Example:
This site is best viewed in a browser that supports frames.
Welcome to our site for prospective dog adopters! Adopting a homeless dog is a most noble deed that will help save the life of the poor creature.

4. JavaScript

This is another hot potato. It is known by everybody that pure HTML is powerless to make complex sites with a lot of functionality (anyway, HTML was not intended to be a programming languages for building Web applications, so nobody expects that you can use HTML to handle writing to a database or even for storing session information) as required by today's Web users and that is why other programming languages (like JavaScript, or PHP) come to enhance HTML. For now search engines just ignore JavaScript they encounter on a page. As a result of this, first if you have links that are inside the JavaScript code, chances are that they will not be spidered. Second, if JavaScript is in the HTML file itself (rather than in an external .js file that is invoked when necessary) this clutters the html file itself and spiders might just skip it and move to the next site. Just for your information, there is a

Saturday, June 13, 2009

What Happens When You Send Email?

You typed in your message and clicked on the Send button in your mailer. What happens next? How does email find its way to the final recipient?

Outgoing SMTP Server
Your Internet Service Provider has a special computer called mail server, which is responsible for collecting mail from their customers. When you click the send button, your computer connects to the mail server and transmits the message along with the list of recipients.

The message is transmitted with Simple Mail Transfer Protocol (SMTP). SMTP is a language that your mail program uses to speak to the mail server. SMTP lets your mail program specify the list of recipients and the text of the message.

The mail server that understands SMTP is frequently called Outgoing SMTP Server.

Destination SMTP Server
Your destination recipient also has an Internet Service Provider. They have a mail server too.

When your SMTP server decides that it's time to send your email, it connects to the destination mail server and transmits the message to it. They use SMTP to speak to each other. Therefore the destination mail server is frequently called Destination SMTP Server.

Destination SMTP server stores the email message until your recipient decides to check if some email arrived.

Incoming POP3 Server
When the recipient decides to check the email, his email program connects to the destination SMTP server and retrieves mail from the server. This time, they use Post Office Protocol Version 3 (POP3), which is a language used to receive mail. Therefore, this server is usually called Incoming POP3 Server.

Destination SMTP server for you is Incoming POP3 server for your recipient. When your recipient replies, the situation reverses - now your mail server becomes a Destination SMTP server from the viewpoint of the reply sender.

POP3 service is not necessary. There are other means to access mail on the destination mail server. For example, you can use Web browser to access Hotmail® or Yahoo Mail without downloading all the mail to your computer.

MX Records
One thing remains unclear. There are millions of SMTP servers all around the world. How your SMTP server knows where to send the message?

This information is stored in so called MX records. MX is not an abbreviation, it doesn't mean anything. The MX record shows which destination SMTP server must be used for the specific email address. For example, the MX record for aysoft@aysoft.com points to the server called slim.aysoft.com.

The database of MX records is maintained by a network called Domain Name Service (DNS). To get access to the MX records, you must have access to the DNS server and have a permission to retrieve MX records.

You can look up MX records for any e-mail address with special software, such as AY Spy.

Relay
As you might have noticed, there are two SMTP servers involved in sending each message. Your outbound SMTP server is working as a relay - it accepts your message and relays it to the other server.

When Internet just started, any SMTP server used to work as relays - the message used to be passed from server to server freely. As the email system got abused by spammers, fewer and fewer servers were working this way. Now all open relays are closed. The only server that will relay for you is the one given to you by your Internet Service Provider.

It is also possible to bypass all relays and send email directly to the destination server. For more details see our article about direct send.

Friday, June 12, 2009

my e-mail sending project in java image attachment sourse code

package mailexamples;

import java.io.UnsupportedEncodingException;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.Date;
import java.util.Properties;

import javax.activation.DataHandler;
import javax.activation.DataSource;
import javax.activation.FileDataSource;
import javax.activation.URLDataSource;
import javax.mail.Authenticator;
import javax.mail.BodyPart;
import javax.mail.Message;
import javax.mail.MessagingException;
import javax.mail.PasswordAuthentication;
import javax.mail.Session;
import javax.mail.Transport;
import javax.mail.internet.AddressException;
import javax.mail.internet.InternetAddress;
import javax.mail.internet.MimeBodyPart;
import javax.mail.internet.MimeMessage;
import javax.mail.internet.MimeMultipart;

public class HtmlJavaMail {

public static void main(String[] args) {
HtmlJavaMail mjm = new HtmlJavaMail();
mjm.sendMail();
}

public HtmlJavaMail() {}

public void sendMail() {
Properties props = new Properties();

props.put("mail.smtp.host", "mail.syerra.com");
props.put("mail.debug","true");
Session session = Session.getDefaultInstance(props, new ForcedAuthenticator());

Message message = new MimeMessage(session);

try {
message.setFrom(new InternetAddress("sujeev@theoturnermedia.com",
"sujeeve"));

message.setRecipient(Message.RecipientType.TO, new InternetAddress(
"kashun@theoturnermedia.com"));

message.setSubject("subject");

MimeMultipart multipart=new MimeMultipart();

BodyPart msgBodyPart=new MimeBodyPart();
msgBodyPart.setContent("

Hi!

","text/html");

BodyPart embedImage=new MimeBodyPart();
DataSource ds=new URLDataSource(new URL("http://www.theoturnermedia.com/bulkemail/cri.JPG"));
// DataSource ds=new FileDataSource("C:/Documents and Settings/Anura/Desktop/New Folder (4)/opti.JPG");
embedImage.setDataHandler(new DataHandler(ds));
embedImage.setHeader("Content-ID","");

// multipart.addBodyPart(msgBodyPart);
multipart.addBodyPart(embedImage);

message.setContent(multipart);

message.setSentDate(new Date());

Transport.send(message);

} catch (AddressException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (MessagingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

}

class ForcedAuthenticator extends Authenticator {
public PasswordAuthentication getPasswordAuthentication() {
return new PasswordAuthentication("sujeev@theoturnermedia.com","sujeev123");

}
}
}

sujeev